4 research outputs found

    ANETAC: Arabic named entity transliteration and classification dataset

    Get PDF
    In this paper, we make freely accessible ANETAC, our English-Arabic named entity transliteration and classification dataset that we built from freely available parallel translation corpora. The dataset contains 79, 924 instances, each instance is a triplet (e, a, c), where e is the English named entity, a is its Arabic transliteration and c is its class that can be either a Person, a Location, or an Organization. The ANETAC dataset is mainly aimed for the researchers that are working on Arabic named entity transliteration, but it can also be used for named entity classification purposes. This dataset was developed and used as part of a previous research study done by Hadj Ameur et al. [1]

    Improving Arabic neural machine translation via n-best list re-ranking

    Get PDF
    Even though the rise of the Neural Machine Translation (NMT) paradigm has brought a great deal of improvement to the machine translation field, the current translation results are still not perfect. One of the main reasons for this imperfection is the decoding task complexity. Indeed, the problem of finding the one best translation from the space of all possible translations was and still is a challenging problem. One of the most successful ways to address it is via n-best list re-ranking which attempts to reorder the n-best decoder translations according to some defined features. In this paper, we propose a set of new re-ranking features that can be extracted directly from the parallel corpus without needing any external tools. The features set that we propose takes into account lexical, syntactic, and even semantic aspects of the n-best list translations. We also present a method for feature weights optimization that uses a Quantum-behaved Particle Swarm Optimization (QPSO) algorithm. Our system has been evaluated on multiple English-to-Arabic and Arabic-to-English machine translation test sets, and the obtained re-ranking results yield noticeable improvements over the baseline NMT systems

    A POS-based preordering approach for English-to-Arabic statistical machine translation

    No full text
    In this work, we present a POS-based preordering approach that tackles both long- and short-distance reordering phenomena. Syntactic unlexicalized reordering rules are automatically extracted from a parallel corpus using only word alignment and a source-side language tagging. The reordering rules are used in a deterministic manner; this prevents the decoding speed from being bottlenecked in the reordering procedure. A new approach for both rule filtering and rule application is used to ensure a fast and efficient reordering. The tests performed on the IWSLT2016 English-to-Arabic evaluation benchmark show a noticeable increase in the overall Blue Score for our system over the baseline PSMT system
    corecore